19 research outputs found

    Effect of missing data on multitask prediction methods

    Get PDF
    There has been a growing interest in multitask prediction in chemoinformatics, helped by the increasing use of deep neural networks in this field. This technique is applied to multitarget data sets, where compounds have been tested against different targets, with the aim of developing models to predict a profile of biological activities for a given compound. However, multitarget data sets tend to be sparse; i.e., not all compound-target combinations have experimental values. There has been little research on the effect of missing data on the performance of multitask methods. We have used two complete data sets to simulate sparseness by removing data from the training set. Different models to remove the data were compared. These sparse sets were used to train two different multitask methods, deep neural networks and Macau, which is a Bayesian probabilistic matrix factorization technique. Results from both methods were remarkably similar and showed that the performance decrease because of missing data is at first small before accelerating after large amounts of data are removed. This work provides a first approximation to assess how much data is required to produce good performance in multitask prediction exercises

    Photoacclimation strategies in northeastern Atlantic seagrasses: Integrating responses across plant organizational levels

    Get PDF
    Seagrasses live in highly variable light environments and adjust to these variations by expressing acclimatory responses at different plant organizational levels (meadow, shoot, leaf and chloroplast level). Yet, comparative studies, to identify species' strategies, and integration of the relative importance of photoacclimatory adjustments at different levels are still missing. The variation in photoacclimatory responses at the chloroplast and leaf level were studied along individual leaves of Cymodocea nodosa, Zostera marina and Z. noltei, including measurements of variable chlorophyll fluorescence, photosynthesis, photoprotective capacities, non-photochemical quenching and D1-protein repair, and assessments of variation in leaf anatomy and chloroplast distribution. Our results show that the slower-growing C. nodosa expressed rather limited physiological and biochemical adjustments in response to light availability, while both species of faster-growing Zostera showed high variability along the leaves. In contrast, the inverse pattern was found for leaf anatomical adjustments in response to light availability, which were more pronounced in C. nodosa. This integrative plant organizational level approach shows that seagrasses differ in their photoacclimatory strategies and that these are linked to the species' life history strategies, information that will be critical for predicting the responses of seagrasses to disturbances and to accordingly develop adequate management strategies.Fundacao para a Ciencia e Tecnologia (FCT), Portugal [PTDC/MAR-EST/4257/2014

    Open Babel: An open chemical toolbox

    Get PDF
    Background: A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendorneutral formats. Results: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion. Conclusions: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license fro

    The distribution of standard deviations applied to high throughput screening

    Get PDF
    High throughput screening (HTS) assesses compound libraries for “activity” using target assays. A subset of HTS data contains a large number of sample measurements replicated a small number of times providing an opportunity to introduce the distribution of standard deviations (DSD). Applying the DSD to some HTS data sets revealed signs of bias in some of the data and discovered a sub-population of compounds exhibiting high variability which may be difficult to screen. In the data examined, 21% of 1189 such compounds were pan-assay interference compounds. This proportion reached 57% for the most closely related compounds within the sub-population. Using the DSD, large HTS data sets can be modelled in many cases as two distributions: a large group of nearly normally distributed “inactive” compounds and a residual distribution of “active” compounds. The latter were not normally distributed, overlapped inactive distributions – on both sides –, and were larger than typically assumed. As such, a large number of compounds are being misclassified as “inactive” or are invisible to current methods which could become the next generation of drugs. Although applied here to HTS, it is applicable to data sets with a large number of samples measured a small number of times
    corecore